Companion to Assessing shop completeness in OpenStreetMap for two federalstates in Germany - Analysis of driving factors for shop completeness
The data set used for the analysis contains the following columns:
- populationDensity: inhabitants per km² (for 2018)
- avShops: current number of retail stores as average value from 2018 to 2020
- asymptote: estimated asymptote via nonlinear regression
- completeness: estimated completeness level [%]
- func: best fit function (SSlogis = three parameter logistic function, SSfpl = four parameter logistic, SSasymp = asymptotic function, SSmicmen = Michaelis-Menten function)
- district type:
- 1 = independent city
- 2 = urban district
- 3 = rural district
- unemployments: rate of unemployed in the civilian labor force [%] (for 2017)
- academics: employees to social security contributions at the place of residence with an academic qualification per 100 inhabitants of working age (for 2017)
- GDP: the gross domestic product (GDP) per person in employment [1000 €] (for 2016)
Explorative analysis
# diagram shops and estimated asymptote
p1 <- ggplot(TabFactors, aes(x=avShops, y=asymptote, text=district)) + theme_minimal() + labs(title="number of shops and asymptote") + ylab("asymptote") + xlab("number of shops") +
geom_point(lwd=0.5, alpha=0.8, aes(colour=districtTypeF)) +
geom_abline(slope=1, intercept = 0, col="grey", lty=2) +
theme(plot.title = element_text(size=10, face = "bold"))+ theme(axis.title.y=element_text(size=9)) +theme(axis.title.x=element_text(size=9))+ theme(legend.text = element_text(size = 9)) + theme(axis.text.y = element_text(size = 7), axis.text.x = element_text(size =7)) +
#theme(legend.position = "bottom") +
theme(legend.title = element_text(size = 9, face = "bold"))
ggplotly(p1, Tooltip=c("district"))#violin plot
ggplot(TabFactors, aes(x=districtTypeF, y=completeness, colour = districtTypeF)) + geom_violin(size=0.3) +
scale_fill_brewer(palette="Dark2") + geom_boxplot(width=0.1, size = 0.3)+
#stat_summary(fun.y=median, geom="point", size=2, color="red")+
#stat_summary(fun.y=mean, geom="point", shape=23, size=2)+
labs(title = "completeness distribution by district type", y = "completeness level [%]", x="")+ theme(legend.position = "none") + theme(plot.title = element_text(size=10, face="bold")) + theme(axis.title.y=element_text(size=9)) + theme(legend.text = element_text(size = 9)) + theme(axis.text.y = element_text(size = 7), axis.text.x = element_text(size =9, colour="black"))+
geom_dotplot(binaxis='y', stackdir='center',position=position_dodge(1), binwidth=0.5, colour = "Black")For interpretation on might want to take into account that more urban areas have also a higher unemployment rate:
ggplot(TabFactors, aes(x=districtTypeF, y= unemployments)) + geom_violin()gUenmp <- lm(unemployments ~ districtTypeF, data= TabFactors)
summary(gUenmp)##
## Call:
## lm(formula = unemployments ~ districtTypeF, data = TabFactors)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1286 -0.5714 -0.5348 0.7061 4.4714
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.5286 0.4022 11.258 4.05e-14 ***
## districtTypeFurban districts -0.9938 0.5102 -1.948 0.0583 .
## districtTypeFindependent cities 1.0429 0.6967 1.497 0.1421
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.505 on 41 degrees of freedom
## Multiple R-squared: 0.2109, Adjusted R-squared: 0.1724
## F-statistic: 5.48 on 2 and 41 DF, p-value: 0.007777
However, differences are not strongly significant.
Regression analysis
A poisson GLM estimates significant coefficients for all four predictors considered in the analysis:
g <- glm(asymptote ~ districtTypeF + GDP + unemployments + academics + offset(log(avShops)), data=TabFactors, family=poisson)
## deviance, estimates and significance
1-g$deviance/g$null.deviance## [1] 0.2112687
summary(g)##
## Call:
## glm(formula = asymptote ~ districtTypeF + GDP + unemployments +
## academics + offset(log(avShops)), family = poisson, data = TabFactors)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.6724 -2.4469 -0.8858 1.1783 19.8407
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.2315733 0.0463588 -4.995 5.88e-07 ***
## districtTypeFurban districts 0.0507730 0.0127542 3.981 6.87e-05 ***
## districtTypeFindependent cities -0.0764611 0.0193815 -3.945 7.98e-05 ***
## GDP 0.0040363 0.0005294 7.624 2.47e-14 ***
## unemployments 0.0482830 0.0042124 11.462 < 2e-16 ***
## academics -0.0118853 0.0020594 -5.771 7.87e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1041.39 on 43 degrees of freedom
## Residual deviance: 821.38 on 38 degrees of freedom
## AIC: 1217
##
## Number of Fisher Scoring iterations: 4
## VIF
vif(g)## GVIF Df GVIF^(1/(2*Df))
## districtTypeF 4.030361 2 1.416890
## GDP 2.307922 1 1.519185
## unemployments 2.635076 1 1.623292
## academics 3.706657 1 1.925268
Adjust for overdispersion
However, to account for overdispersion in the data we have to use a negative binomial GLM. This reduces the number of significant predictors due to the additional uncertainty that was (wrongly) ignored by the poisson GLM.
g <- glm.nb(asymptote ~ districtTypeF + GDP + unemployments + academics + offset(log(avShops)), data=TabFactors)
## deviance, estimates and significance
1-g$deviance/g$null.deviance## [1] 0.2581558
summary(g)##
## Call:
## glm.nb(formula = asymptote ~ districtTypeF + GDP + unemployments +
## academics + offset(log(avShops)), data = TabFactors, init.theta = 60.3149139,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9202 -0.6626 -0.2318 0.3660 4.1812
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.356284 0.218597 -1.630 0.10313
## districtTypeFurban districts 0.020293 0.049716 0.408 0.68314
## districtTypeFindependent cities -0.122508 0.094728 -1.293 0.19592
## GDP 0.005399 0.002640 2.045 0.04084 *
## unemployments 0.061954 0.018100 3.423 0.00062 ***
## academics -0.011840 0.009886 -1.198 0.23107
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(60.3149) family taken to be 1)
##
## Null deviance: 59.045 on 43 degrees of freedom
## Residual deviance: 43.802 on 38 degrees of freedom
## AIC: 567.14
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 60.3
## Std. Err.: 13.7
##
## 2 x log-likelihood: -553.14
## VIF
vif(g)## GVIF Df GVIF^(1/(2*Df))
## districtTypeF 2.952698 2 1.310855
## GDP 1.900122 1 1.378449
## unemployments 2.201008 1 1.483579
## academics 2.984513 1 1.727574
Need to simplify model:
g <- update(g, ~. -academics)
summary(g)##
## Call:
## glm.nb(formula = asymptote ~ districtTypeF + GDP + unemployments +
## offset(log(avShops)), data = TabFactors, init.theta = 58.22614057,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7917 -0.6318 -0.2090 0.3476 4.2538
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.335472 0.221280 -1.516 0.1295
## districtTypeFurban districts 0.003967 0.048742 0.081 0.9351
## districtTypeFindependent cities -0.199261 0.072656 -2.743 0.0061 **
## GDP 0.004311 0.002510 1.717 0.0859 .
## unemployments 0.054463 0.017052 3.194 0.0014 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(58.2261) family taken to be 1)
##
## Null deviance: 57.131 on 43 degrees of freedom
## Residual deviance: 43.769 on 39 degrees of freedom
## AIC: 566.55
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 58.2
## Std. Err.: 13.2
##
## 2 x log-likelihood: -554.555
vif(g)## GVIF Df GVIF^(1/(2*Df))
## districtTypeF 1.676818 2 1.137945
## GDP 1.660710 1 1.288685
## unemployments 1.888730 1 1.374311
drop1(g)## Single term deletions
##
## Model:
## asymptote ~ districtTypeF + GDP + unemployments + offset(log(avShops))
## Df Deviance AIC
## <none> 43.769 564.55
## districtTypeF 2 52.484 569.27
## GDP 1 46.724 565.51
## unemployments 1 54.119 572.90
(This is the model used in the manuscript.)
g <- update(g, ~. -GDP)
summary(g)##
## Call:
## glm.nb(formula = asymptote ~ districtTypeF + unemployments +
## offset(log(avShops)), data = TabFactors, init.theta = 54.48131943,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5501 -0.6844 -0.2439 0.3335 4.2223
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.026062 0.076401 0.341 0.73301
## districtTypeFurban districts 0.008873 0.050061 0.177 0.85932
## districtTypeFindependent cities -0.140452 0.065857 -2.133 0.03295 *
## unemployments 0.037443 0.014481 2.586 0.00972 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(54.4813) family taken to be 1)
##
## Null deviance: 53.677 on 43 degrees of freedom
## Residual deviance: 43.894 on 40 degrees of freedom
## AIC: 567.42
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 54.5
## Std. Err.: 12.3
##
## 2 x log-likelihood: -557.421
vif(g)## GVIF Df GVIF^(1/(2*Df))
## districtTypeF 1.278499 2 1.063347
## unemployments 1.278499 1 1.130707
1 - g$deviance / g$null.deviance## [1] 0.1822509
Compare with model with GDP instead of unemployment rate
g2 <- glm.nb(formula = asymptote ~ districtTypeF + GDP + offset(log(avShops)), data = TabFactors, link = log)
summary(g2)##
## Call:
## glm.nb(formula = asymptote ~ districtTypeF + GDP + offset(log(avShops)),
## data = TabFactors, link = log, init.theta = 46.79504818)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2052 -0.6465 -0.2734 0.2403 4.8963
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.2297373 0.1553730 1.479 0.139
## districtTypeFurban districts -0.0331359 0.0527921 -0.628 0.530
## districtTypeFindependent cities -0.1034678 0.0727771 -1.422 0.155
## GDP -0.0004187 0.0022939 -0.183 0.855
##
## (Dispersion parameter for Negative Binomial(46.795) family taken to be 1)
##
## Null deviance: 46.498 on 43 degrees of freedom
## Residual deviance: 44.057 on 40 degrees of freedom
## AIC: 573.9
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 46.8
## Std. Err.: 10.5
##
## 2 x log-likelihood: -563.9
Clearly worst, so we stick with g. Are were interactions?
g2 <- glm.nb(formula = asymptote ~ districtTypeF * unemployments + offset(log(avShops)), data = TabFactors, link = log)
summary(g2)##
## Call:
## glm.nb(formula = asymptote ~ districtTypeF * unemployments +
## offset(log(avShops)), data = TabFactors, link = log, init.theta = 55.99395915)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7989 -0.6115 -0.2448 0.2852 3.9883
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -0.01953 0.08769 -0.223
## districtTypeFurban districts 0.15718 0.14882 1.056
## districtTypeFindependent cities -0.01963 0.24440 -0.080
## unemployments 0.04732 0.01728 2.738
## districtTypeFurban districts:unemployments -0.03902 0.03730 -1.046
## districtTypeFindependent cities:unemployments -0.02340 0.04344 -0.539
## Pr(>|z|)
## (Intercept) 0.82373
## districtTypeFurban districts 0.29090
## districtTypeFindependent cities 0.93598
## unemployments 0.00618 **
## districtTypeFurban districts:unemployments 0.29544
## districtTypeFindependent cities:unemployments 0.59021
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(55.994) family taken to be 1)
##
## Null deviance: 55.075 on 43 degrees of freedom
## Residual deviance: 43.816 on 38 degrees of freedom
## AIC: 570.21
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 56.0
## Std. Err.: 12.6
##
## 2 x log-likelihood: -556.213
Nope.
Diagnostics
plot(g, which=1:6, labels.id = TabFactors$district)Nordsachsen is highly influential, characterized by a high cook’s distance and a high leverage. Görlitz is also remarkable due to its high leverage.
What if we leave out Nordsachsen?
#TabFactors$district
gPart <- glm.nb(asymptote ~ unemployments + offset(log(avShops)), data=TabFactors, subset= district != "Nordsachsen, rural district" )
## deviance, estimates and significance
1-gPart$deviance/gPart$null.deviance## [1] 0.008840667
summary(gPart)##
## Call:
## glm.nb(formula = asymptote ~ unemployments + offset(log(avShops)),
## data = TabFactors, subset = district != "Nordsachsen, rural district",
## init.theta = 121.1133921, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4628 -0.7879 -0.2650 0.6885 2.8794
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.167831 0.041805 4.015 5.95e-05 ***
## unemployments -0.005804 0.009459 -0.614 0.539
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(121.1134) family taken to be 1)
##
## Null deviance: 43.223 on 42 degrees of freedom
## Residual deviance: 42.841 on 41 degrees of freedom
## AIC: 518.7
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 121.1
## Std. Err.: 29.8
##
## 2 x log-likelihood: -512.697
If we drop Görlitz as another district with high leverage, the relationship with unemployment holds.
gPart <- glm.nb(asymptote ~ unemployments + offset(log(avShops)), data=TabFactors, subset= district != "Nordsachsen, rural district" | district != "Görlitz, rural district")
## deviance, estimates and significance
1-gPart$deviance/gPart$null.deviance## [1] 0.07963517
summary(gPart)##
## Call:
## glm.nb(formula = asymptote ~ unemployments + offset(log(avShops)),
## data = TabFactors, subset = district != "Nordsachsen, rural district" |
## district != "Görlitz, rural district", init.theta = 48.01488863,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2442 -0.6825 -0.2325 0.3200 4.4812
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.05807 0.06134 0.947 0.3438
## unemployments 0.02571 0.01361 1.890 0.0588 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(48.0149) family taken to be 1)
##
## Null deviance: 47.645 on 43 degrees of freedom
## Residual deviance: 43.851 on 42 degrees of freedom
## AIC: 568.62
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 48.0
## Std. Err.: 10.7
##
## 2 x log-likelihood: -562.621
If we only drop Görlitz
gPart <- glm.nb(asymptote ~ districtTypeF + unemployments + offset(log(avShops)), data=TabFactors, subset= district != "Nordsachsen, rural district" | district != "Görlitz, rural district")
## deviance, estimates and significance
1-gPart$deviance/gPart$null.deviance## [1] 0.1822509
summary(gPart)##
## Call:
## glm.nb(formula = asymptote ~ districtTypeF + unemployments +
## offset(log(avShops)), data = TabFactors, subset = district !=
## "Nordsachsen, rural district" | district != "Görlitz, rural district",
## init.theta = 54.48131943, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5501 -0.6844 -0.2439 0.3335 4.2223
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.026062 0.076401 0.341 0.73301
## districtTypeFurban districts 0.008873 0.050061 0.177 0.85932
## districtTypeFindependent cities -0.140452 0.065857 -2.133 0.03295 *
## unemployments 0.037443 0.014481 2.586 0.00972 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(54.4813) family taken to be 1)
##
## Null deviance: 53.677 on 43 degrees of freedom
## Residual deviance: 43.894 on 40 degrees of freedom
## AIC: 567.42
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 54.5
## Std. Err.: 12.3
##
## 2 x log-likelihood: -557.421
Coefficient estimates stay the same if Görlitz is excluded or not.
Districts for those no reliable saturation level could be estimated
gSat <- glm(noSaturationEstimated ~ districtTypeF + unemployments + GDP, data=TabFactorsAll, family=binomial)
summary(gSat)##
## Call:
## glm(formula = noSaturationEstimated ~ districtTypeF + unemployments +
## GDP, family = binomial, data = TabFactorsAll)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0738 -0.6984 -0.6634 -0.5518 1.8746
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.735336 3.266384 -0.531 0.595
## districtTypeFurban districts 0.398078 0.878130 0.453 0.650
## districtTypeFindependent cities 0.944409 0.989588 0.954 0.340
## unemployments 0.104953 0.272542 0.385 0.700
## GDP -0.005128 0.035826 -0.143 0.886
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 61.210 on 56 degrees of freedom
## Residual deviance: 59.523 on 52 degrees of freedom
## AIC: 69.523
##
## Number of Fisher Scoring iterations: 4
drop1(gSat)## Single term deletions
##
## Model:
## noSaturationEstimated ~ districtTypeF + unemployments + GDP
## Df Deviance AIC
## <none> 59.523 69.523
## districtTypeF 2 60.452 66.452
## unemployments 1 59.670 67.670
## GDP 1 59.544 67.544
gSat <- update(gSat, ~.- districtTypeF)
summary(gSat)##
## Call:
## glm(formula = noSaturationEstimated ~ unemployments + GDP, family = binomial,
## data = TabFactorsAll)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9813 -0.7163 -0.6530 -0.6292 1.8313
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.593145 2.852478 -0.909 0.363
## unemployments 0.182698 0.214270 0.853 0.394
## GDP 0.008218 0.031821 0.258 0.796
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 61.210 on 56 degrees of freedom
## Residual deviance: 60.452 on 54 degrees of freedom
## AIC: 66.452
##
## Number of Fisher Scoring iterations: 4
drop1(gSat)## Single term deletions
##
## Model:
## noSaturationEstimated ~ unemployments + GDP
## Df Deviance AIC
## <none> 60.452 66.452
## unemployments 1 61.169 65.169
## GDP 1 60.517 64.517
gSat <- update(gSat, ~.- unemployments)
summary(gSat)##
## Call:
## glm(formula = noSaturationEstimated ~ GDP, family = binomial,
## data = TabFactorsAll)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7560 -0.7280 -0.7152 -0.6779 1.8189
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.819006 2.003200 -0.409 0.683
## GDP -0.005747 0.028487 -0.202 0.840
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 61.210 on 56 degrees of freedom
## Residual deviance: 61.169 on 55 degrees of freedom
## AIC: 65.169
##
## Number of Fisher Scoring iterations: 4
drop1(gSat)## Single term deletions
##
## Model:
## noSaturationEstimated ~ GDP
## Df Deviance AIC
## <none> 61.169 65.169
## GDP 1 61.210 63.210
ggplot(TabFactorsAll, aes(x=noSaturationEstimated, y= GDP)) + geom_boxplot(varwidth=TRUE) + facet_wrap(~districtTypeF)ggplot(TabFactorsAll, aes(x=noSaturationEstimated, y= unemployments)) + geom_boxplot(varwidth=TRUE) + facet_wrap(~districtTypeF)